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Abstract. The complexity of the Metropolis-Hastings (MH) algorithm arises from the re¬ 
quirement of a likelihood evaluation for the full data set in each iteration. [Payne and Malli(^ 


(20151 propose to speed up the algorithm by a delayed acceptance approach where the ac¬ 
ceptance decision proceeds in two stages. In the first stage, an estimate of the likelihood 
based on a random subsample determines if it is likely that the draw will be accepted and, 
if so, the second stage uses the full data likelihood to decide upon final acceptance. Eval¬ 
uating the full data likelihood is thus avoided for draws that are unlikely to be accepted. 
We propose a more precise likelihood estimator which incorporates auxiliary information 
about the full data likelihood while only operating on a sparse set of the data. We prove 
that the resulting delayed acceptance MH is more efficient compared to that of [Payne andj 


Mallick (20151. The caveat of this approach is that the full data set needs to be evaluated 


in the second stage. We therefore propose to substitute this evaluation by an estimate and 
construct a state-dependent approximation thereof to use in the first stage. This results in 
an algorithm that (i) can use a smaller subsample m by leveraging on recent advances in 
Pseudo-Marginal MH (PMMH) and (ii) is provably within 0{m~^) of the true posterior. 
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1. Introduction 


Markov Chain Monte Carlo (MCMC) methods have been the workhorse for sampling from 
nonstandard posterior distributions in Bayesian statistics for nearly three decades. Recently, 
with increasingly more complex models and/or larger data sets, there has been a surge of 
interest in improving the 0{n) complexity emerging from the necessity of a complete data 
scan in each iteration of the algorithm. 

There are a number of approaches proposed in the literature to speed up MCMC. Some 
authors divide the data into different partitions and carry out MCMC for the partitions in 
a parallel and distributed manner. The draws from each partition’s subposterior are sub¬ 
sequently combined to obtain an approximation of the full posterior distribution. This line 


of work includes Scott et al. (2013); Neiswanger et ah (2013); Wang and Dunson (2013); 


Minsker et al. (2014); Nemeth and Sherlock (2016), among others. Other authors use a sub¬ 


sample of the data in each MCMC iteration to speed up the algorithm, see e.g. Korattikara 


et al. (2014), Bardenet et al. (2014), Maclaurin and Adams (2014), Maire et al. (2015), Bar- 


denet et al. (2015) and Quiroz et al. (2016, 2017). Finally, delayed acceptance MCMC has 


been used to speed up computations (Banterle et ah, 2014, Payne and Mallick, 2015). The 


main idea in delayed acceptance is to avoid computations if there is an indication that the 


proposed draw will ultimately be rejected. Payne and Mallick (2015) consider a two stage 


delayed acceptance MCMC that uses a random subsample of the data in the first stage to 
estimate the likelihood function at the proposed draw. If this estimate suggests that the pro¬ 
posed draw is likely to be rejected, the algorithm does not proceed to the second stage that 


evaluates the true likelihood (using all data). Banterle et al. (2014) divide the acceptance 


ratio in the standard Metropolis-Bastings (MB) (Metropolis et al., 1953, Bastings, 1970) into 


several stages, which are sequentially performed until a first rejection is detected implying 
a final rejection of the proposed draw. Clearly, both delayed acceptance approaches save 
computations for proposals that are unlikely to be accepted, but on the other hand require 
all stages to be computed for those likely to be accepted. The latter corresponds to at least 
the same computational cost for one iteration as the standard MB. 
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This paper extends the delayed acceptance algorithms of Payne and Mallick (2015) in the 
following directions. First, we replace the likelihood estimator of the first stage by an efficient 
estimator that employs control variates to significantly reduce the variance. We show that 
this modification results in an algorithm which is more effective in promoting good proposals 
to the second stage. Since this algorithm computes the true likelihood (as does MH) in the 
second stage whenever a draw is up for a final accept/reject decision, we refer to it as Delayed 
Acceptance (standard) MH (DA-MH). Second, we propose a delayed acceptance algorithm 
that overcomes the caveat of a full data likelihood evaluation in the second stage by replacing 
it with an estimate. This second stage estimate is initially developed in the approximate 


subsampling Pseudo-Marginal MH (PMMH) framework in Quiroz et ah (2016). Thus, our 
second contribution is to speed up their algorithm by combining with delayed acceptance 
and we document large speedups for our so called Delayed Acceptance PMMH (DA-PMMH) 


algorithm compared to MH. Payne and Mallick (2015) instead propose to circumvent the full 
data evaluation by combining their delayed acceptance sampler with the consensus Monte 


Carlo in Scott et ah (2013), which currently lacks any control of the error produced in the 


approximate posterior. In contrast, we can make use of results in Quiroz et ah (2016) to 
control the approximation error and also ensure that it is where m is the subsample 

size used for estimating the likelihood. 

While the delayed acceptance MH has the disadvantage of using all data whenever a 
proposed draw proceeds to the second stage, we believe that the successful implementation 
provided here is essential if exact simulation is of importance. By exact simulation we 
mean that the Markov chain produced by the subsampling algorithm has the same invariant 
distribution as that of a Markov chain produced by an algorithm that uses all the data. 
Exact simulation using subsets of the data has proven to be extremely challenging. Pseudo¬ 


marginal MCMC (Beaumont, 2003, Andrieu and Roberts, 2009) provides a framework for 


conducting Markov chain simulation by only using an estimate (in this setting estimated 
from a subsample of the data) of the likelihood. Remarkably, although the true likelihood 
is never evaluated, the simulation is exact provided that the likelihood estimate is unbiased 
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and almost surely positive. One route to exact simulation by data subsampling suggested by 


Bardenet et al. (2015) is to compute a sequence of unbiased estimators of the log-likelihood, 


apply the technique in Rhee and Glynn (2015) to debias the resulting likelihood estimator and 


subsequently use it within a pseudo-marginal MCMC. Bardenet et al. (2015) note that, as 


proved by Jacob and Thiery (2015), a lower bound on the log-likelihood estimators is needed 
to ensure positiveness. This typically requires computations using the full data set, naturally 


defeating the purpose of subsampling. Quiroz et al. (2017) instead suggests to compute a soft 
lower bound which is dehned to be a lower bound with a high probability. Since the estimate 


can occasionally be negative, the pseudo-marginal in Quiroz et al. (2017) instead targets an 


absolute measure following Lyne et al. (2015), who show that the draws can be corrected with 
an importance sampling type of step to estimate expectations of posterior functions exactly. 
We note that although this certainly is exact inference of the expectation (it is not biased) the 


algorithm does not provide exact simulation as dehned above. Maclaurin and Adams (2014) 
propose Firehy Monte Carlo, which introduces an auxiliary variable for each observation, 
which determines if it is included when evaluating the likelihood. The distribution of these 
binary variables is such that when they are integrated out, the marginal posterior is the same 
as the one targeted by MH, thus providing exact simulation. Typically a small fraction of 
observations are included, hence speeding up the execution time signihcantly compared to 
MH. However, the augmentation scheme severely affects the mixing of the Firehy algorithm 
and it has been demonstrated to perform poorly compared to MH and other subsampling 


approaches (Bardenet et al., 2015; Quiroz et al. 


2017). We conclude that delayed 


acceptance, out of the discussed methods, seems to be the only feasible route to obtain exact 
simulation via data subsampling. We demonstrate that our implementation is crucial to 
obtain an algorithm that can improve on MH. 

This paper is organized as follows. Section presents the efficient likelihood estimator. 
Sections[^and|^outline the delayed acceptance MH and PMMH methodologies in the context 
of subsampling. Section [^applies the method to a micro-economic data set containing nearly 
5 million observations. Section concludes and Appendix [A] proves Theorem 


































DELAYED ACCEPTANCE MCMC BY DATA SUBSAMPLING 


5 


2. Likelihood estimators with subsets of data 

2.1. Structure of the likelihood. Let 6 be the parameter in a model with density x^), 

where yk is a potentially multivariate response vector and Xk is a vector of covariates for 
the kth observation. Let lk { 0 ) = \ ogp { yk \ 9 , Xk ) denote the kth observation’s log-density, 
k = 1, ... ,n. Given conditionally independent observations, the likelihood function can be 
written 


( 2 , 1 ) 


p{y\e) = exp 11(e )], 


where l{6) = ^ k { 9 ) is the log-likelihood function. To estimate p { y \ 0 ), we hrst estimate 

l{9) based on a sample of size m from the population {h{9 ),..., G(6*)} and subsequently use 
(ig, The hrst step corresponds to the classical survey sampling problem of estimating a 


population total: see Sarndal et ah (2003) for an introduction to survey sampling. Note that 


(2.1) is more general than identically independently distributed (iid) observations, although 


we require that the log-likelihood can be written as a sum of terms, where each term depends 
on a unique piece of data information. One example is models with a random effect for each 
subject, with possibly multiple individual observations per subject (longitudinal data). In 
this case, a single term in the log-likelihood sum corresponds to the log joint density for 
a subject, and we therefore sample subjects (rather than individual observations) when 


estimating (2.1). 


2.2. Data subsampling. A main distinction of sampling schemes is whether the sample is 
obtained with or without replacement. It is clear that sampling with replacement gives a 
higher variance for any function of the sample: including the same element more than once 
does not provide any further information about finite population characteristics. However, it 
results in sample elements that are independent which facilitates the derivation of Theorem 
[^for the delayed acceptance MH. It also allows us to apply the theory and methodology in 
for the delayed acceptance PMMH in Section It should be noted that 
the two sampling schemes are approximately the same when m <^n. 


Quiroz et ah (2016 
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Let u = (mi, ... Urn) be a vector of indices obtained by sampling m indices with replacement 
from {1,..., n} and let {lui{d)i ■ ■ ■ i be the sample. We will consider Simple Random 

Sampling (SRS) which means that 


Pr(Mj = k) = — for k = 1,... ,n, and for alH = 1,...,m. 
n 


Payne and Mallick (2015) use SRS (but without replacement) to form an unbiased estimate 
of the log-likehood. However, the elements lk{G) (for a fixed 6) vary substantially across the 
population and estimating the total Yl'k=i ^k{0) with SRS is well known to be very inefficient 
under such a scenario. Instead Pr(wj = k) should be (approximately) proportional to a size 
measure for lk{0). Quiroz et ah ( |2016 ) argue that this so called Proportional-to-Size sampling 
is in many cases unlikely to be successful as knowledge of a size measure (which also depends 
on 9) for all k can often defeat the purpose of subsampling. They instead propose to use 
SRS but incorporate control variates in the estimator for variance reduction which we now 
turn to. 


2.3. Efficient log-likelihood estimators. The idea in Quiroz et al. (2016 ) is to homogenize 
the population {h{9),... ,ln{6)}: if the resulting elements are roughly of the same size then 
SRS is expected to be efficient. Let qk{9) denote an approximation of lk{9) and decompose 


( 2 . 2 ) 


m 


k=l k=l 


q{9) + d{9), 


where 

n n 

qi0) = J2(lk{0), d{9) = Y,dk{0), and dk{9) = lk{9) - qk{0). 

k=l k=l 

We emphasize that all quantities depend on 6 which, from now on, is sometimes suppressed 
for a compact notation. 

Define the random variables r]i = nd^^ and d^^ = — q^^ with E[r]i] = d and 

n n 

= ^[Vi] = n^^[dui] = n'^idk - ^i)^ with d = ^^4. 

k=l k=l 
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The difference estimator estimates d in (2.2) with the Hansen-Hurwitz estimator (Hansen 


and Hurwitz, 1943), 


1 J \ - - cr^ 

dm = r]i, with E[dm\ = d and V[dm\ = — • 

m 


m 


Z=1 


We can obtain an unbiased estimate where is the usual unbiased sample 

variance estimator (of the population {di,... d„}). The estimator of the log-likelihood is thus 

,2 


(2.3) 


L = qk{0) + dm, with E[L] = I and cr^ = V[L] = — • 

Tn 


k=l 

'2 


It also follows that = dWm is an unbiased estimator of 


Quiroz et al.|([20l6) reason that observations close in data space {xk, Dk) should, for a fixed 


9, have similar lk{d) values. They cluster the data space into K clusters and approximate, 
within each cluster, lk{d) by a second order Taylor series expansion qk{d) around the centroid 


of the cluster. This allows computing q using K evaluations (instead of n), see Quiroz et ah 


(2016) for details. Bardenet et al. (2015) propose similar control variates, but instead expand 
with respect to 9 around a reference value 9*. While it can be shown that q can be computed 
using a single evaluation, the approximation can be inaccurate when 9 is far from 9* giving 
a large cr^. 


We note that the log-likelihood estimate in Payne and Mallick (2015) (if with replacement 


sampling is used) is a special case of (2.3), namely when qk = ^ for all k. We will see that this 
results in a poor performance of the delayed acceptance algorithm and that control variates 
are crucial for a successful implementation. 


2.4. Approximately bias-corrected likelihood estimator . It is clear that an unbiased 
log-likelihood estimator becomes biased for the likelihood when transformed to the ordinary 
scale by the exponential function. Since by the Central Limit Theorem (CLT) 


177 Qm(6') —/(6') j —>■ A/'(0, cr^) as m —)■ oo. 
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Quiroz et al. (2016) (see also Ceperley and Dewing||1999 Nicholls et al.||2012 ) approximately 
bias corrects the likelihood estimate 


(2.4) 


f)m{y\0,u) = exp (L(6') - (Tj(6')/2m) . 


Equation (2.4) is unbiased if a'i is used in place of (and normality holds for Im)- In 


practice we need to use a'i (to not use all data) and Quiroz et al. (2016) show that (2.4) is 


asymptotically unbiased, with the bias decreasing as 0{m ^). For the rest of the paper we 


refer to (2.4) as a “bias-corrected” estimator, where the quotation marks highlight that the 


correction is not exact. 


3. Delayed acceptance MH with subsets of data 


3.1. The algorithm . Payne and Mallick (2015) propose a subsampling delayed acceptance 


MH following [Christen and (2005). The aim in delayed acceptance is to simulate a 
Markov chain which admits the posterior 

piy\9)p(9) f 

^(^) =-ru-) where p(y) = / p(y\9)p(9)d9 and p(9) denotes the prior, 

p{y) J 

as invariant distribution. Moreover, the likelihood p{y\9) should only be evaluated if there 
is a good chance of accepting the proposed 9. 


The algorithm in Payne and Mallick (2015) proceeds as follows. Let 9c = 9^^^ denote the 


cnrrent state of the Markov chain. In the first stage, propose 9' ~ (li{9\9c) and compute 

Pmiy\9\u)p{9')/qi{9'\9c 


(3.1) 


C(i{9c —>■ 9') = min < 1, 


Pm {y\9c, u)p{9c)/qi{9c\9') 


where pm{y\9,u) is the estimator in Section 2.4 without control variates for (but not 
“bias-corrected”, see below). Now, propose 


9p = 


9' w.p. ax{9c,9') 

9c w.p. 1 - ai(6'c,6''). 
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and move the chain to the next state = Op with probability 

P{y\dp)pi0p)/q2{dp\0c) 


(3.2) 

where 


a 2 { 0 c —)■ Op) = min < 1, 


P{y\0c)p{0c)/q2{0c\0p) 


q2{^p\^c) — Q!i(6'c —)■ 6 'p)gi( 6 'p| 6 'c) + r(6'c)56»c(^p); ''"(^c) — 1 ~ y ctiiOc ^ ^p)qi{^p\^c)dOp, 

and 5 is the Dirac delta function. If rejected we set 0 d+i) = Oc- 

Note that a 2 in ( |3.2 ) is equivalent to the acceptance probability of a standard MH with 
a proposal density q 2 {dp\ 0 c) that is a mixture of two proposal densities: the hrst proposes 
to move from Oc to Op (from the “slab” gi) and the second proposes to stay at Oc (from the 
“spike”). The mixture weight for the “slab” is ai in (3. ^ (with 0' = Op): if the likelihood 
estimate is higher at O' compared to that of Oc (after correcting with gi) we propose O' = Op 
with probability 1. Conversely, if it is lower, we propose to move but with a probability 
smaller than 1 which decreases the less likely we think that O' will be accepted as indicated 


by the estimated likelihood. The estimator pm{y\0'^u) used by Payne and Mallick (2015) 


does not depend on the current state of the Markov chain and hence the mixture weights in 
g 2 are state-independent. Convergence to the invariant distribution therefore follows from 
standard MH theory. The same applies when the estimator uses the control variates in 
Quiroz et ah ( |2016 ), or in [Bardenet et al.| (|2015[) but only for a fixed 0*. [Bardenet et ah 


(2015) suggest setting 0* = Oc every now and then to prevent that the control variates can be 


poor if the chain is far from 0*: the resulting approximation is clearly state-dependent and 
standard MH theory does not apply. Instead, convergence to the invariant t^{0) is proved in 


Christen and Fox (2005) and the delayed algorithm is exactly as above but with 


Pm(2/|-,M) = P^m\y\'i'^)i emphasizing that it depends on the current state Oc- 


The state-dependent algorithm is a key ingredient when developing the delayed acceptance 
block PMMH in Section HI 
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As noted above, Payne and Mallick (2015) do not “bias correct” the likelihood estimate 
as they point out that the algorithm will have the correct invariant distribution anyway. 
We remark that without control variates it is not a good idea to apply the correction as 
the variance of is huge which adversely affects 'Pm{y\-,u). An efficient estimate of 
however, would certainly improve prn{y\--,u). While the control variates allow us to estimate 


d^ accurately, we will not implement the “bias-correction” when comparing to Payne and 


Mallick (2015) in order to make the comparison fair. 


It is benehcial to also update u in order to avoid the risk of having a subset of observations 
for which the approximation is poor. This is especially important for the estimator in [Payne' 


and Mallick (2015), as not using control variates can result in the particular subset having 


highly heterogeneous elements, which is detrimental for SRS. We note that updating u is still 
a valid MH because (i) u is not a state of the Markov chain and (ii) the distribution p{u) = 
l/n™ (SRS) does not depend on 6. Thus, the transition kernel of an algorithm that updates 
M is a state-independent mixture of transition kernels, where each of the kernels satishes 


detailed balance either by standard MH (or Christen and Fox (2005) if the approximation 
depends on 6c)- Since the weights 1/n”^ do not depend on 6, it follows that the mixture also 
satishes detailed balance, and thus has 7r{6) as invariant distribution. It is unnecessary to 
update u in every iteration, instead one can update u randomly to save the overhead cost of 
indexing the data matrix when obtaining the subset of observations. 

We remark that delayed acceptance (sometimes with names as early rejection or surrogate 
transition), although without data subsampling, has been considered earlier in the literature. 


References include Fox and Nicholls (1997); Liu (2008); Cui et al. (2011); Smith (2011); 


Solonen et al. (2012); Golightly et al. (2015); Sherlock et al. (2015a). Each stage in Banterle 


et al. (2014) (see Section uses a partition of the data and can thus be considered as 


delayed acceptance with data subsampling. The advantage of our algorithm is that, because 
it only has two stages and the second stage evaluates the full data likelihood, we can instead 
estimate this likelihood in order to never do the evaluation for the full data set, see Section 

SI 
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3.2. Efficiency of delayed acceptance MH when subsampling the data. When con¬ 
sidering efficiency of MCMC algorithms with additional computational costs (e.g. estimating 
the target), there are two types of fundamentally different efficiencies that interplay. The 
first is the statistical efficiency, which we will measure by the asymptotic (as the number 
of MCMC iterates go to infinity) variance of an estimate based on output from the Markov 
chain. Consider two MCMC algorithms Ai and A 2 with the same invariant distribution. 
Then Ai is statistically more (less) efficient than A 2 if it has a lower (higher) asymptotic 
variance. The second is the computational efficiency, which solely concerns “execution time” 
to produce a given number of iterates. The measures of statistical and computational effi¬ 
ciency (and a combination of them) used in this article are presented later in this section. 

Sherlock et ah ( 2015b| ) (in a non-subsampling context) study the statistical efficiency for 


delayed acceptance random walk Metropolis and, moreover, an efficiency that also takes into 
account the computational efficiency for the case where the target is estimated (DA-PMMH 
in Section]^. 


Christen and Fox (2005) note that, because the transition kernels of both the MH and 


delayed acceptance MH are derived from the same proposal qi, and in addition 02 < 1, the 
delayed acceptance MH will be less statistically efficient than MH. The intuition is that under 
these conditions the chain clearly exhibits a more “sticky” behavior and an estimate based 
on these samples will have a larger asymptotic variance under DA-MH than MH. Notice 
that the closer a 2 is to 1, the more statistically efficient the delayed acceptance algorithm 
is, and when 02 = 1 it is equivalent to the standard MH which gives the upper bound of the 
possible statistical efficiency achieved by a DA-MH. 


Result 1 in Payne and Maffick (2015) gives the alternative formulation (for state-independent 
approximat ions) 


(3.3) 


«2(6'c —t 9p) 


min 


^ Pm{y\0c,u)/p{y\0c) \ 

' Pm{y\dp,u)/p{y\ep) J ’ 
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Let lk{9c, Op) = lk{0c)-lk{0p) and denote by lm{0c, Op) the estimate of/(^c, Op) = YJk=i ^k{0c, Op 


Similarly to (2.3), 


(3.4) 


Op) 0.(0c) Op) ^ ^ ^ Ciy with (liOc) Op) ^ ^ Q'fc(^C) Op), 


i=l 


k=l 


where qk{0c, Op) = qkiO^) — qk{0p) and the O’s are iid with 

Pr (0 = nDk{0c, Op)) = with Dk = {hiOc, Op) - qk{0c, Op)) for i = 1,... m. 

n 

We can also show that 

(3.5) 

E[L( 6 'c, Op)] = l{0c, Op) and V[l,n{0^, Op)] = with (^1 = n'^ {Dk{0c, Op) - Df{0c, Op))^ , 


k=l 


where is the mean over the fnll popnlation. When not “bias-corrected”, the ratio appear¬ 


ing in (3.3) becomes 


(3,6) 


Rm exp (^mi^Oc, Op) l{0(^, Op)^ ■ 


We now propose a theorem that relates E [ 02 ( 6*0 —t 6 *^)] to the variance = y[lm{0c,0p)] 
nnder the assnmption that ImiOc, Op) ~ M{l{0c,0p),a‘F) (eqnivalently r'U logA 6 ( 0 ,a|) in 


(3.6)). In tnrn, 02 relates to the statistical efficiency as discnssed above. The assnmption of 
normality is jnstified by a standard CLT for lm{0c, Op) since the Ci’s are iid. 

Theorem 1. Suppose that we run a DA-MH with a state-independent approximation 
im{0c, Op) ~ Af {l{0c, Op),aF) , where (t\{ 0^, Op) = V[log(i?m)], 


which has a second stage acceptance probability 


c^2{0c t Op) min (1, Rm ), Rm exp (^lm{0c, Op) l(0c,, Op) 


Then 


R[a2{0,^0p)] = exp(4(0„0p)/2)(l-$(aR(0o,0p))) + O.5. 
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In particular, E[a2{0c —)• Op)] is a decreasing function of (Jr- 


Proof. See Appendix [X| 


□ 


Remark. It is possible to state and prove a similar theorem using a “bias-corrected” likelihood 
estimator, which in log-scale is lm{0c,0p) — a‘j^{6c,0p)/2. This is omitted as our application 
in Example 1 in Section does not use a “bias-corrected” likelihood estimator in order to 


conduct a fair comparison to Payne and Mallick (2015). 


Theorem says that if the variance of the log of the ratio Rm in (3.6) is lower, then the 
algorithm has a higher a 2 on average. Moreover, it shows that a 2 deteriorates quickly with 
an, which illustrates the importance of achieving a small an. 

Although the delayed acceptance MH is always less statistically efficient than MH it can 
of course still be more generally efficient in terms of balancing computational and statistical 
efficiency. Clearly, a necessary condition to achieve a more general efficient DA-MH algo¬ 
rithm than the corresponding MH requires that its computing time is faster, i.e. it must be 
computationally more efficient. When comparing DA-MH algorithms with the same com¬ 
puting time (by for example having the same subsample size) then Theorem shows that a 
smaller variance of the log-ratio results in a more general efficient algorithm. 

To also compare algorithms of different computing times we now dehne a measure for 
the general efficiency discussed above. In the rest of the paper, whenever we claim that 
algorithms are more or less efficient to each other it is based on this measure. The statistical 
(in)efficiency part is measured by the Inefficiency Factor (IF), which quantihes the amount 
by which the variance of (1/A^) '^f=i 0^^'^ is inflated when 0^^'^ is obtained by Markov chain 
simulation compared to that of iid simulation. It is given by 

OO 

(3.7) IF = 1 + 2J2pi, 

1=1 

where pi is the auto-correlation at the /th lag of the chain, and can be computed with 


the coda package in R (Plummer et ah, 2006). To include computational efficiency in the 
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measure we use Effective Draws (ED) per computing unit 


(3.8) 


ED 


N 

IF X t’ 


where N is the number of MCMC iterations and t is the computing time. The measure 
of interest is the effective draws per computing time of a delayed acceptance algorithm A 
relative to that of MH, i.e. 


(3.9) 


RED 


ED'^ 

ED^‘ 


4. Delayed acceptance PMMH with subsets of data 


4.1. PMMH for data subsampling. Quiroz et al. (2016) propose a pseudo-marginal ap¬ 


proach to subsampling based on the “bias-corrected” likelihood estimator in (2.4). In the 
next subsection this estimate replaces the true likelihood, thereby avoiding the evaluation of 
the full data set. To allow for a smaller subsample size without adversely affecting the mixing 


of the chain Quiroz et al. (2016) develop a correlated pseudo-marginal approach based on 


Deligiannidis et al. (2016). Tran et al. (2016) (see also Quiroz et ah, 2016) use an alterna¬ 


tive approach to correlated pseudo-marginal which we use for our Delayed Acceptance Block 


PMMH (DA-BPMMH) in Section 4.3 Although not considered here, it is straightforward 


to instead correlate the subsamples as in Deligiannidis et al. (2016). 

Pseudo-marginal by data subsampling targets the following posterior on an augmented 
space {9, u), 


(4.1) 


Timid, U) = Pmiy\9,u)piu)pi9)/pmiy), With Pmiy) 


Pmiy\9)pi0)d9. 


The algorithm is similar to MH except that 6 and u are proposed and accepted (or rejected) 
jointly, with probability 

1 Pmiy\9piUp)pi9p)/qi6p\6(.') ^ 

’ Pmiy\9c,Uc)pi9c)/qi9c\9p) J ’ 


(4.2) 


ttPMMH 


mm 
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where the subscripts denote the current and proposed values of Q and u. Quiroz et ah 


(2016) show that the algorithm converges to a slightly perturbed target (only approximately 
unbiased likelihood estimator) 71^(6') and prove that 

- 71 ( 0)1 


7r(0) 


< 0{m 


- 2 ^ 


Moreover, if h{6) is a function that is absolutely integrable with respect to vr(0), then 
[h{6)] is also within of its true value. 

The variance of Im is crucial for the general efficiency of the algorithm: in the approxi¬ 


mate interval [1,3.3] is optimal (Pitt et ah, 2012; Doucet et ah, 2015, Sherlock et ah, 2015c). 
The correlated pseudo-marginal approach induces a high positive correlation between the 


estimates in (4.2) by correlating Uc and Up. This allows the use of a less precise estimator 
without getting stuck: the errors in the numerator and denominator tend to cancel. As 
a consequence, can be larger than above, hence speeding up the algorithm by taking a 


smaller subsample. We follow Tran et al. (2016) and divide u (defined in Section 2.2) into 
G blocks, 

Tfl 

u = (m(i), ..., M(g)) with — observations within each block, 

(j 

and update a single block (randomly) in each iteration. Setting G large induces a high 
positive correlation p between lm{0c) and Imi^p)- 


Tran et al. 


(2016) show that p = 1 


1/G under certain assumptions. We set G = 100 for the state-dependent algorithm in our 
application. 


4.2. State-independent algorithm: DA-PMMH. Delayed acceptance PMMH uses an 
estimate of the target in the second stage of the algorithm. Such algorithms have recently 
been considered by Sherlock et al. ( 2015bpi ) and Golightly et al. ( |2015 ). We propose a state- 
independent data subsampling DA-PMMH (uncorrelated PMMH) and a state-dependent 


(block PMMH, where the correlation between the subsamples follows Tran et ah, 2016) 


extension DA-Block (correlated) PMMH (DA-BPMMH) in the next subsection. 
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The estimated “bias-corrected” likelihood in (2.4) is approximated in a first stage screening 


by s which is discussed in Section 4.4 Similarly to the algorithm in Section 3.1, we desire to 
only evaluate the (estimated) target (on the augmented space) if the proposed state is likely 
to be accepted. Let {Oc,Uc) = denote the current state of the augmented Markov 

chain. In the first stage, propose 9' ~ (li{9\9c) and u' ~ p{u) and evaluate 

s{9',u')p{9')/qi{9'\9c 


(4.3) 


a™{( 0 e,wc)^( 0 ',w)} = min<|l, 


s{9c,Uc)p{6c)/qi{9c\9') j ’ 


where s{9,u) is the approximation of Pm( 2 /| 6 ',M) (2.4). Propose 


{Op, Up) 


{0',U') W.p. «™{(0e,Mc)^(0',M')} 

{9c, Uc) w.p. 1 - {{9c, Uc) {O', u')] , 

and move to = {9p,Up) with probability 

Pm{y\9p, Up)p{9p)Iq2{9p\9c) 


(4.4) {{9c,Uc) ^ {9p,Up)] = min <( 1, 


Pm{y\9 Cl'^c)p{9c) / q2{9 c\9p) J 


where 


q2{9p\9c) = «™gi(0p|0c) + r{9c)6eX0v). r{9c) = 1- J «™gi(0,|0,)d0„ 

and 5 is the Dirac delta function. If rejected we set = {9c, Uc)- Similarly to the 

argument for the DA-MH, we recognize as the acceptance probability for a pseudo¬ 

marginal algorithm. Since the approximation is state-independent {s{9,u) independent of 
the current state implies state-independent mixture weights for ^ 2 ), convergence follows from 


being a pseudo-marginal algorithm (Andrieu and Roberts, 2009). However, as the estimate 
is biased the target is perturbed (Quiroz et ah, 2016) but within 0{m~‘^) as discussed in 
Section 14.11 


4.3. State-dependent algorithm: DA-BPMMH. As u is part of the state in a pseudo¬ 
marginal algorithm, obtaining a state-dependent approximation is easily achieved by corre¬ 
lating u\ we sample Up ~ p{u\uc) thus s{9,u) = sec,uc{^,u). This is the state-dependent (on 
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the augmented space) setup in Christen and Fox (2005) and it follows that the invariant 


distribution is Tr^iO^u) in (4.1). This is the invariant distribution in the algorithm in Quiroz 


et al. (2016), which we already mentioned has a marginal 71^(0) within 0{m~‘^) of 7^(6). We 
note that this state-dependent approximation is very convenient because it automatically al¬ 
lows us to have a higher variance on because of the block PMMH mechanism (see Section 


4.1) 


4.4. An approximation of the estimated likelihood. Our approximation is inspired by 


adaptive delayed acceptance ideas in Cui et al. (2011) and Sherlock et al. (2015a). However, 


our approach is not strictly adaptive as we only learn about the proposal for a fixed training 
period of Wrain iterations, which are discarded from the hnal draws. 

The idea is to use a sparser set of the data to construct control variates in the 

hrst stage. During the training period we learn about the discrepancy between = 

Ylk=i '?(^) second stage (obtained with a denser set), i.e. 

(4.5) q{6) — q^^\0) = f{6) -|- e, e is the noise (assumed independent of 6). 

'-V-' 

=e(e) 

Learning about / is a standard regression problem; the collection of proposed 0’s during the 
hxed training period are the inputs and the discrepancies are the training data. 

The trivial decomposition 

n 

lm{e) = Y,qk{9)+drr.{0)-al/2 

k=l 

= ^ ) +dm{9)-a‘^j2 

k=l \k=l k=l / 

suggests the hrst stage approximation 

n 

Kd , u ) = '^ q ‘;}\ 9 ) + e { 9 ) + d ^{ 9 ) - al ^/ 2 , 

k=l 


where e{9) is the prediction of the discrepancy at 9. Note that computing the true discrepancy 
requires the denser set to be evaluated, whereas the prediction is very fast. For example, in 
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linear (or non-linear by basis functions) regression the parameters of f{6) are estimated once 
after the training data has been collected. Prediction of e{6) for a new “data-observation” 
6 is then a simple dot product computation, which is typically much faster than evaluating 
q{0). Note that if the u’s are correlated then s{6,u) also depends on the current state, i.e. 

Sdc,Uc{^J '^)- 

In Section we implement both a linear regression and (noise free) Gaussian process to 
learn / in (|4.5[), but any regression technique can be used. 


5. Application 

5.1. Data and model. We model the probability of bankruptcy conditional on a set of 
covariates using a data set of 534, 717 Swedish hrms for the time period 1991-2008. We 
have in total n = 4, 748, 089 hrm-year observations. The variables included are: earnings 
before interest and taxes, total liabilities, cash and liquid assets, tangible assets, logarithm 
of deflated total sales and logarithm of hrm age in years. We also include the macroeconomic 
variables GDP-growth rate (yearly) and the interest rate set by the Swedish central bank. 


See Giordani et ah (2014) for a detailed description of the data set. 


We consider the logistic regression model 
p{yk\xk,/3) = 


Vk 


i-yfc 


1-F exp(a;^/?)/ \1 + exp{-xl/3) 

where Xk includes the variables above plus an intercept term. We set p{(3) ~ N{0, 10/) for 
simplicity. Since the bankruptcy observations {i/k = 1) are sparse in the data, we follow 


Payne and Mallick (2015) and estimate the likelihood only for the yk = 0 observations. That 


is, we decompose 

HU) = Y. '‘U)+ Y '‘C). 

{k;yk=^} {k-,yk=0} 

and evaluate the hrst term whereas a random sample is only taken to estimate the second 


term. The second term clearly follows the structure presented in Section 2.1 
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5.2. Performance evaluation measures. The quantity of interest is the effective draws as 


defined in (3.8). We now outline how to compute t as CPU time and present an alternative 
measure independent of the implementation that is based number of evaluations. We first 
outline a robust measure of the CPU time. 

The delayed acceptance algorithms we implement have two stages, where the first stage is 
filtering out draws unlikely to be accepted. One can view MH as an algorithm that does not 
filter any proposed draws at all: any draw is subject to an accept/reject decision based on 
the second stage likelihood. To make total CPU time comparisons fair between a Delayed 
Acceptance (DA) MH and its corresponding MH, we compute the total time for the latter 
(MH) by the median time the former (DA) spends in the second stage and multiply by the 
number of MCMC iterations N: 


CPUmh = N X median time DA stage 2. 

The median is used to avoid extreme values that can arise due to external disturbances (CPU 
time should be independent of 6 here). For the delayed acceptance algorithm the CPU time 
is 


CPUda = N X median time DA stage 1 + FullEval x median time DA stage 2, 

where FullEval is the number of second stage evaluations the delayed acceptance performs. 

The (total) number of evaluations measure is straightforward for MH [N x n), DA-MH 
without control variates {N xm + FullEval x n) and with {N x {K + m) + FullEval x n), and 
PMMH/BPMMH [N x [K + m)). For the delayed versions of PMMH/BPMMH we similarly 
add the different evaluations, but note that it will be different for the pre- and post- training 
period. Moreover, there is a one time cost of learning f{9) and also a (post-training) cost of 
predicting e{d). We translate these to number of evaluations as follows. First, for learning 
f{9) we measure the CPU time it takes to fit it with the training data. We compare this 
time to the average CPU time for the iterations during the training period. We do so 
because we know the measure of the latter in terms of number of evaluations: + K + 
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where and K are, respectively, the number of centroids in the sparser and denser set of 
data. Therefore, if learning f{6) is say T times slower in CPU time, then this translates to 
T ■ + K + m) number of evaluations. Finally, the number of evaluations for predicting 

a single e{6) depends on the model for f{0). For linear regression the prediction is a dot 
product which we assign the same cost as computing a single log-likelihood contribution 
(which is typically a function of a dot product). For a Gaussian process, the prediction 
requires evaluating the kernel for iVtrain observations and we let that dehne the number of 
evaluations for a single prediction. 


5.3. Implementation details. We consider the following two examples. Example 1 esti¬ 
mates the model with DA-MH implemented with our efficient control variates and compares 


it to the implementation in Payne and Mallick (2015). Example 2 estimates the model 


with DA-BPMMH and compare to the block PMMH algorithm in Quiroz et al. (2016). To 


make comparisons fair for Example 1 we use without replacement sampling (as in Payne and 


Mallick, 2015). This sampling scheme is typically used together with the Horvitz-Thompson 


estimator (Horvitz and Thompson, 1952): see Sarndal et al. (2003) on how to modify the 
formulas in Section |2.3 for without replacement sampling. 

Two main implementations of the difference estimator are considered. The first computes 
Qk with the second order term evaluated at f3, which we call dynamic. The second, which we 
call static, Exes the second order term at the optimum f3*. The dynamic approach clearly 
provides a better approximation but is more expensive to compute. For both the dynamic and 
static approaches we compare four different sparse representations of the data for computing 


q in (2.2), each with a different number of clusters. The clusters are obtained using Algorithm 


1 in Quiroz et al. (2016) on the observations for which y = 0 [A, 706, 523 observations). We 
note that, as more clusters are used to represent the data, the approximation of the likelihood 
is more accurate, although it is more expensive to compute. 

We consider a Random walk MH proposal for (3 where we learn the proposal scale during 
the first Wrain = 5,000 (and also train f{6)) iterations in order to reach an acceptance 


probability of ~ 0.23 for MH (Roberts et al., 1997) and ~ 0.10 for BPMMH. For the delayed 
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acceptance algorithms we have the same targets but for ai, i.e. the first stage acceptance 
probability. We discard the training samples and also a subsequent burn-in period of 10% of 
the remaining samples (20, 000) when doing inference. However, the computing costs (CPU 
and number of evaluations) include all N iterations. 

Finally, the delayed acceptance algorithms are implemented with an update of u with 
probability 0.01. 


5.4. Example 1: DA-MH. Tables 0 and summarize the results, respectively, for the 


difference estimator with control variates (DE) and the estimator in Payne and Mallick 


(2015) (PM). It is evident that the difference estimator has a larger second stage acceptance 
probability a 2 (for a given sample size), which is a consequence of Theorem because it 
has a lower = U[log(i?m)]- Figure 0 confirms that the normality assumptions, for the 
smallest value of m and K (the worst case scenario), are adequate for both methods. We 
also note from Table that for some sample sizes Payne and Mallick (2015) performs more 
poorly than the standard Metropolis-Hastings algorithm. One possible explanation is that 


the applications in Payne and Mallick (2015) have a small number of continuous covariates 
(one in the first application and three in the second) and the rest are binary. It is clear that 
the continuous covariate case results in more variation among the log-likelihood contributions 
which is detrimental for SRS. In this application we have eight continuous covariates which 
explains why SRS without covariates performs poorly for small sampling fractions. As an 
example, for a subsample of 0.1% of the data, not a single effective sample was obtained. 


5.5. Example 2: DA-BPMMH. Our second example explores how the state-dependent 


delayed acceptance BPMMH improves the BPMMH proposed in Quiroz et al. (2016). The 
results are presented in Table [Quiroz et al. (2016) in turn show that they outperform 
other subsampling approaches and, in particular, that many of these approaches perform 


more poorly than the standard MH (see also Bardenet et ah, 2015) in terms of efficiency 


and/or can give a very poor approximation of the posterior. Here we find that the delayed 
acceptance BPMMH is 30 times more efficient than MH, which is a huge improvement 
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Table 1. Delayed acceptance MH with control variates. The table shows some 
quantities for the static and dynamic implementation with different sparse rep¬ 
resentations of the data represented by K, which is the number of clusters (ex¬ 
pressed as % of n). For each approximation different sample sizes (0.1,1,5 in 
% of n) are considered. The quantities are the mean REDi and RED 2 in (3.9) 
measured with respect to computing time and average number of evaluations, 
respectively. Furthermore, dR is the mean (over MCMC iterations) standard 
deviation of log(Rm). Finally, ai and 02 are the acceptance probabilities in 
(3.1) and (3.3) (expressed in %), where the latter is computed conditional 
on acceptance in the first stage. The corresponding MH algorithm has an 
acceptance rate of ^23%. 


Static Dynamic 



REDi 

R,ED2 

an 

ai 

a2 

REDi 

RED 2 

VR 

Otl 

Ot2 

K = 0.03 

0.1 

0.93 

0.95 

5.90 

22 

14 

2.35 

2.56 

1.48 

23 

57 

1 

2.43 

2.65 

1.40 

22 

59 

2.23 

3.47 

0.45 

22 

85 

5 

2.09 

2.78 

0.57 

24 

82 

0.73 

3.27 

0.19 

24 

93 

K = 0.21 

0.1 

1.51 

1.52 

3.87 

21 

24 

2.87 

2.84 

0.83 

25 

74 

1 

2.81 

3.02 

0.99 

22 

69 

3.03 

3.91 

0.27 

22 

91 

5 

2.05 

2.71 

0.39 

26 

88 

0.78 

3.17 

0.11 

25 

96 

K = 0.71 

0.1 

2.24 

2.20 

2.10 

21 

46 

3.03 

3.49 

0.41 

23 

86 

1 

3.53 

3.30 

0.60 

22 

81 

1.97 

3.70 

0.13 

23 

96 

5 

2.22 

3.09 

0.26 

22 

92 

0.70 

3.28 

0.06 

24 

98 

K = 3.68 

0.1 

2.64 

2.75 

0.82 

21 

74 

1.28 

3.63 

0.13 

21 

96 

1 

3.71 

3.19 

0.25 

23 

92 

1.15 

3.57 

0.04 

21 

99 

5 

3.60 

2.93 

0.11 

23 

96 

0.60 

2.95 

0.02 

24 

99 


considering the aforementioned facts. Moreover, we hnd that the approximate posterior 
produced is very close to the true posterior (simulated by MH), as illustrated in Figure 


6. Conclusions 

We explore the use of the efficient and robust difference estimator in a delayed acceptance 
MH setting. The estimator incorporates auxiliary information about the contribution to the 
log-likelihood function while keeping the computational complexity low by operating on a 
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Table 2. Delayed acceptance MH without control variates (Payne and 


Mallick, 2015). The table shows some quantities for different sample sizes 


(0.1,1,5,50,80, in % of n) to estimate the likelihood. The quantities are the 
mean REDi and RED 2 in (3.9) measured with respect to computing time 
and number of evaluations, respectively. Furthermore, dn is the mean (over 
MCMC iterations) standard deviation of log(Rm)- Finally, ai and 02 are the 
acceptance probabilities in (3.1) and (3.3) (expressed in %), where the latter 
is computed conditional on acceptance in the first stage. The corresponding 
MH algorithm has an acceptance rate of ^23%. 



REDi 

RED 2 


cti 

Q!2 

0.1 

0.00 

0.00 

101.94 

21 

0 

1 

0.36 

0.37 

13.81 

18 

3 

5 

1.39 

1.45 

3.52 

22 

29 

50 

1.13 

1.33 

0.63 

24 

80 

80 

0.91 

1.08 

0.31 

24 

90 


Without control variates 



With control variates 



Figure 1. Addressing the normality assumption in Theorem^^ The figure 
shows a histogram of (standardized) lm,nidc,0p) in (3.4) (10,000 Monte Carlo 
replicates) without control variates (left, [Payne and Mallick 2015) and with 
control variates (right). The red solid line is the density function of a standard 
normal variable. The validation is for the smallest value of m (0.1% of n) for 
both cases and for the smallest value of K (less accurate, 0.03% of n) for the 
control variates. The values of the parameters are 9c = 0*, where 9* is the 
mode, and 9p is sampled from the proposal distribution. We have verified the 
assumption for several values of 9c and 9p (not shown here). 


sparse set of the data. We demonstrate that the estimator is more efficient than that of 


Payne and Mallick (2015) in terms of having a much lower variance. Moreover, we prove 


that a lower variability implies that the delayed acceptance algorithm is more efficient, as 
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Table 3. Delayed aeceptance block PMMH. The table shows some quantities 
for Block PMMH (BPMMH, Quiroz et ah, 2016) and delayed acceptance BP- 
MMH. The latter is implemented with a Gaussian Process (GP) and Linear 
Regression (LR) for learning f{9) to predict e{9) in (4.5). The block PMMH 
use a sample size of 0.5% which corresponds to ~ 10. The approximations 
used by BPMMH for computing the likelihood estimate is based on K = 3.68 
(expressed as % of n) number of clusters. The delayed acceptance version uses 
an approximation based on = 0.71 (expressed as % of n) in the first stage 
(and the same as BPMMH for the second stage). The quantities are the mean 
RED 2 in (3.9) measured with respect to number of evaluations. Moreover, 
tti and 02 are the acceptance probabilities in ( |3.1 ) and (3.3) (expressed in 
%), where the latter is computed conditional on acceptance in the first stage. 
The corresponding BPMMH algorithm targets an acceptance rate of ~10%, 
obtaining 8.4% in this run. 



RED 2 

0-1 

a2 

BPMMH 

14.03 



DA-BPMMH (GP) 

25.53 

9.6 

95 

DA-BPMMH (LR) 

30.19 

9.2 

99 


measured by the probability of accepting the second stage conditional that the first stage 
is accepted. In an application to modeling of hrm-bankruptcy, we hnd that the proposed 
delayed acceptance algorithm is feasible in the sense that it improves on the standard MH 


algorithm, which is sometimes not true for Payne and Mallick (2015). We argue that previous 
approaches (discussed in the introduction) of exact simulation by MGMG are either (i) only 
possible under unfeasible assumptions or (ii) very inefficient (compared to MH). We therefore 
believe that exact simulation by subsampling MGMG is only possible by a delayed acceptance 
approach, and the implementation provided here is crucial for success. 

Next, we realize that a delayed acceptance approach has the caveat of scanning the com¬ 
plete data when deciding upon final acceptance. We propose a state-dependent delayed 
acceptance that replaces the second stage evaluation with an estimate. This algorithm in¬ 
herently allows for correlating the subsamples used for estimating the likelihood, and we can 
leverage on recent advances in the pseudo-marginal literature to reduce the computational 
cost. Moreover, we show that it is a special case of the state-dependent delayed acceptance 
in Ghristen and Fox (2005) and thus convergence to an invariant distribution follows. This 


distribution is perturbed because the second stage estimate is biased, but we can control the 
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Figure 2. Kernel density estimations of marginal posteriors. The figure 
shows the marginal posteriors with the different algorithms MH (standard MH, 
solid black line), BPMMH (block PMMH, dashed blue line) and DA-BPMMH 
(delayed acceptance BPMMH, dashed red line). The approximations used by 
BPMMH for computing the likelihood estimate is based on K = 3.68 (ex¬ 
pressed as % of n) number of clusters with ~ 10. The delayed acceptance 
version uses an approximation based on = 0.71 (expressed as % of n) in 
the first stage. 


error and ensure that it is within 0{m~‘^) of the true posterior. We demonstrate that the 
approximation is very accurate and we can improve on MH by a factor of 30 in terms of a 
measure that balances statistical and computational efficiency. 
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Appendix A. Proof of Theorem 1 


Proof of Theorem 1. It follows from the normality assnmption that X = Rm log7\/(0,a|) 
with density 
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log(a:)^ , m > 0. 


The expectation of the acceptance probability a2{0c —>■ Op) with respect to X is 


E[min(l,X)] = / xf{x)dx+ / f{x)dx. 


Since median(X) = 1 we obtain f[x)dx = 0.5. Now, 
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with y = log(a;). The integrand is the pdf of F ~ a|.) and thus 

E[niin(l,X)] = exp ( 0 -^ 2 ) (1 - + 0.5. 

We now show that E[niin(l,X)] is decreasing in We have that 

^E[niin(l,X)] = exp (^^2) - aR^{aR) - , 

and we can (numerically) compute the maximum of the expression in brackets on the right 
which is ~ —0.23. Since exp((T^/ 2 ) > 0 it follows that ^^E[min(l, X)] < 0 and hence 
E[q; 2 ( 6 *c —t Op)] decreases as a function of aR. 

□ 



